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Abstract 

Inter-coder agreement measures, like Cohen's K", correct the relative frequency 
of agreement between coders to account for agreement which simply occurs by 
chance. However, in some situations these measures exhibit behavior which make 
their values difficult to interprete. These properties, e.g. the "annotator bias" or the 
"problem of prevalence", refer to a tendency of some of these measures to indicate 
counterintuitive high or low values of reliability depending on conditions which 
many researchers consider as unrelated to inter-coder reliability. However, not all 
researchers agree with this view, and since there is no commonly accepted formal 
definition of inter-coder reliability, it is hard to decide whether this depends upon 
a different concept of reliability or simply upon flaws in the measuring algorithms. 

In this note we therefore take an axiomatic approach: we introduce a model 
for the rating of items by several coders according to a nominal scale. Based upon 
this model we define inter-coder reliability as a probability to assign a category 
to an item with certainty. We then discuss under which conditions this notion of 
inter-coder reliability is uniquely determined given typical experimental results, 
i.e. relative frequencies of category assignments by different coders. 

In addition we provide an algorithm and conduct numerical simulations which 
exhibit the accuracy of this algorithm under different model parameter settings. 



1 Introduction 

Measuring the agreement between the nominal ratings of a set of items by several 
coders or judges is a common task in a number of disciplines like medical, psycho- 
logical, and social sciences, content analysis and marketing. Simply measuring the 
percentage of agreement is not adequate as it does not take into account agreement 
which simply occurs by chance. There have been proposed a number of inter-coder 
reliability measures to cope with this effect, the most prominent being k 0, % (03), 
El), CC (B), and S 0, see for a survey. 

These measures are defined as ratios of chance-corrected numbers of observed 
agreement vs. maximal agreement and differ in the way the chance-correction is taken 
into account. The ways these corrections are computed, give rise to some criticism of 
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these measures, because they "favor" or "penalize" certain coder behaviors which are 
considered as inappropriate by some researchers. 

Though usually not explicitly stated (cf. also p. 294]), the basic assumption 
is that a coder either assigns a category by certainty resp. "expert judgment" (Brennan 
and Prediger, [4, p. 689]) or assigns some category without being absolutely sure about 
his or her choice. Obviously, it is not possible for an individual assignment to identify 
whether the assignment was done by certainty or not, sometimes not even the rater 
himself or herself may be sure about what the exact reasons for his or her choice are. 

At one extreme point is the 5-value which assumes a uniform distribution of cate- 
gories when "chance assignments" occurs. Scott's % and Cohen's K on the other hand 
use "marginal distributions", i.e. the overall distribution of category assignments by 
each rater, to correct for chance agreement. Using marginal distributions may lead 
to incorrect chance correction since these distributions also include assignments made 
by certainty and thus may also be more than marginally influenced by the distribu- 
tion of categories according to the population of items. Using uniform distribution on 
the other hand may underestimate chance agreement if there are categories that coders 
hardly ever choose. There exists a considerable literature on this subject, see e.g. [2], 

0, 0, 0, (9], ED H2- 

Obviously it is hard to reach at a consensus about which strategy a coder will follow 
in general when category assignment is not done by certainty. In our model we thus 
will not presume a certain distribution to account for chance agreement. 

Cohen's K exhibits a feature, usually called "annotator bias" which describes the 
fact that K yields higher values when coders produce widely diverging marginal dis- 
tributions than when the marginal distributions are similar. See [2, section 3.1] who 
support this feature, [6|, 0,|H3 for criticism, lfT31 for a formal proof. Scott's %, in 
contrast, uses the common marginal distribution of the coders and so "favors" coders 
that produce similar marginal distributions. 

In order to measure inter-coder reliability (in contrast to intra-coder, i.e. test-retest 
reliability) it is necessary that the experiment can be reproduced when conducted in the 
same way with another group of coders (which of course may be restricted to a certain 
base population e.g. trained in some way, but not delimited to some particular individu- 
als). So an inter-coder reliability measure should (approximately) yield the same value 
for every sufficiently large subset of coders from the prescribed population of coders 
and the coders' marginal distributions may vary according to some distribution which 
depends on the population of coders. 

Another debated fact is the prevalence problem, referring to the fact that some of 
these measures ( K,n,a) produce low scores when one category is predominant among 
the ratings (see 0, 0, 0, for examples and discussion). 

There is some debate on this issue. While [6|, [7] and |9| consider this as a weak- 
ness, it is justified by Artstein and Poesio with the argument that "reliability in such 
cases is the ability to agree on rare categories" section 3.2]. This latter argument 
is somewhat problematic for statistical measures which usually are designed to exhibit 
typical not exceptional behavior. In our model we will take an approach which defines 
reliability as a property common to the category assignments and independent of the 
relative frequency of the ("true" or "correct") items' categories. However it will turn 
out that reliability can only be determined if not all items belong to one category. 
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The approach we take differs from these measures as we start with an axiomatic 
model-based definition of inter-coder reliability, which will be a probability of some 
event. This has the nice side effect that the value of the reliability parameter can be 
stated as a probability of an idealized coder's behavior and thus has a direct interpreta- 
tion. 

In addition basing the definition of inter-coder reliability upon such a model one 
may simulate coder ratings with a known reliability parameter and thus may evaluate 
the accuracy of algorithms under different setups. We will do this in Section|4]for the 
algorithm we provide. 

Though the author believes that the model used here is fairly general, there might 
be situations in which it could be deemed unfeasible. Here the explicit statement of 
the model's assumptions helps to determine whether the model is acceptable in an 
experiment or not. We will take a closer look at some of the assumptions of the model 
and their possible impact on reliability results at the end of the next section. 

2 The Model 

We denote by C = {ci,. . . ,c m } the (finite) set of m categories, into which N items, 
Njv, are to be classified by the R raters, N«. We use N„ to denote the natural numbers 
{1,. ..,«}. 

The common assumptions for inter-rater agreement are (rephrased from [5 1): 

(i) The items are independent 

(ii) The categories are independent, mutually exclusive, and exhaustive. 

(iii) The raters operate independently 

Assumption |(ii)| that categories are exhaustive and mutually exclusive implies that 
for every item there is one and only one "correct" category. In other words, assump- 
tion [(n)] above implies the existence of a (usually unknown) function 

7 : N N -> C. 

We will sometimes call y(k) the "true" category associated with item k, without any 
philosophical implication of the term "true". 

For each c e C let N c := #y~ 1 (c) denote the number of items whose true category 
is c, and write T c := jf for the relative frequency of these items. 

If a coder rates an item he or she may either be sure about the category to be chosen 
or not. If the coder is sure about the item's category it seems natural to assume that the 
coder will assign this category to the item (so we assume that the coders will not cheat 
but will assign a category to the best of their knowledge). 

Now, what happens in the case the coder is not completely sure about the category 
to assign? In this case, considering a large set of such items, we will observe a certain 
relative frequency for the categories to be chosen. In general it is hard to know which 
strategy the coder will take and this is frequently debated in the context of Cohen's 
K. Coders might follow some "base rate" i.e. are guided by some assumption about 
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the distribution of categories in the population of items, or may choose the category 
according to a uniform distribution on the set of categories (cf. e.g. Q, ifTTI ). 

There are certainly good reasons for many of these assumptions and it is probably 
also dependent upon the field of research (e.g. medical diagnosis vs. speech analysis), 
upon the kinds of items, the professional background and education of the raters (e.g. 
scholars vs. laymen) and many more properties. Hence we will not assume any partic- 
ular distribution but only assume that such a distribution exists. 

To formalize we thus assume that given an item k a rater recognizes the true cat- 
egory y(k) with a probability j3 . If the coder fails to recognize it he or she assigns a 
"random" category with some unknown distribution. The assumption [(!)] above sug- 
gests to model these actions by independent random variables. 

So formally let Z k be 0-1 -valued, Y k be C-valued independent random variables 
k € and assume that both families are identically distributed. 

We define a coder's rating of item k E Nn by the outcome of the random variable 
X k given by 

x k -J 7[k) ' lfZ ^° CD 

\Y k ,ifZ* = l. 

We let ft := F(Z k = 0) and p c := V(Y k = c). 
It is immediate from the definition that 

¥(X k = c) = p8 cm + (l-P)p c , (2) 

so the distribution of X,- 1 is a mixture of the atomic distribution at y(k) and p = 
(P y *(c)) t - G c with mixture parameter /3. (Here 8 is Kronecker's delta, i.e. 8 xy = 1 if 
x — y and otherwise.) 

For convenience let us call this model the coder model with parameters (j3, y,p), 
where p = (p c )cac- Throughout this note we will tacitly let Njy denote the domain and 
C the codomain of y. A family of independent C-valued random variables {X k ) ke ^ N , 
which satisfies (f2|i is called a coder process for the coder model. 

According to the assumption |(iii)| above several coders are modeled by independent 
families (Xj k ) ke -^ N , where the subscript i refers to the coder. 

If a rater chooses to assign category c to an item k he or she may either be certain 
about the items category or may be uncertain and assigns c by chance only. Gwet ifTUl 
section 4] uses this same interpretation of the rating process. In our model certainty 
occurs when the coder chooses the category according to j(k), i.e. when Z k = 0. So 
it seems reasonable to use the probability j3 = ¥(Z k = 0) as agreement indicator, let 
us call it the reliability parameter of the coder model. Of course an assignment to 
category y(k) also occurs when Y k — j{k), which happens with probability Py( k y 

Aickin [ 1 ] also used a mixture model to study inter-coder reliability. In our notation 
the mixture distribution in Aickin's model is the distribution 

(ci,c 2 ) i ^ E(-#{i G N N : Xi A = c x }-#{i e N w : X i>2 = c 2 }) 

which is not a mixture distribution in our model, cf. so our model is different from 
Aickin's. 
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There are three features of this model which may need a second look: 
The first feature is that coders are modeled by identically distributed families of 
random variables. This might seem oversimplifying since generally every coder may 
have his or her own preference. Actually this feature touches the controversy about 
"annotatorbias". 

As we already discussed in the introduction, inter-coder reliability in contrast to 
intra-coder reliability is only present if the experiment can be reproduced with differ- 
ent coders from some coder population. Our model uses the parameters j3 and p to 
characterize the coder population. 

The second feature that may deserve closer consideration, is concerned with the 
a priori distribution p being independent of the item in question. Actually often one 
may arrive at the situation where for a particular item the coders easily may rule out 
some categories but are doubtful about some others. In this situation assuming that 
the a priori distribution is the same for all items is indeed oversimplifying. Without 
this assumption, however, the model would be completely useless. Indeed, if p would 
be dependent on the item k we could simply put pW to the distribution of categories 
obtained for this item and find out that every outcome could be obtained with reliability 
parameter j3 = 0, i.e. by pure randomness. 

If in some experimental setup the independence of p on the item would be deemed 
a relevant issue, it would be advisable to split the set of items into subsets such that the 
a priori distribution could be considered the same for all items in each of the subsets. 

The third feature which deserves attention is that the probability to identify item k 
as belonging to category y(k) is independent of y(k), i.e. that j3 is considered indepen- 
dent of c. It is easy to imagine a situation where some subset of categories are more 
easily distinguished from each other than for another subset. In this situation it would 
indeed be more appropriate to assume j3 to be dependent of c. However this would 
entail the necessity to report several values as reliability parameter, which would make 
comparisons more difficult. 

Even here one should cope with this feature by a careful design of the experiment 
(choice of categories). We will return to this aspect later (following Proposition|5]l. 

3 Inter- Coder Agreement 

According to our model inter-coder reliability is the parameter j3 in (HJ which, since 
7 is unknown, is not directly observable in experiments. In experiments only relative 
frequencies of category assignments can be observed, i.e. we can observe jj#{i & Njv : 
Xj i — c\ , . . . ,Xj r = c, -} or, idealized, the expectation values of it. In the present sec- 
tion we will discuss under which conditions j3 can be uniquely determined from these 
expectation values. 

Throughout this section we will frequently use the following relations, the proof of 
which is obvious from (0 and the independence of X^. 
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ei .,:=E(Jj#{keN N :X k = c}^ = j8t + (1 - P)p c (3) 

e2, Cl ,c 2 ■= E (Jj#{k e f% : = ci ,X 2 ,* = c 2 }^ 

= J3 2 5 C , C2 T C1 +J8(1 -J3)(t c1jPc2 + t C2/Jci ) + (1 -P) 2 Pci Pc 2 (4) 
e 2 , c := e 2 , c ,c = /3 2 t c + 2/3 (1 - j3)t c /> c + (1 - p) 2 p 2 c (5) 

e 3 , c := E (J^#{k eN N :X Lk = X 2 . k = X 3 ,* - c}) 

= 3 t, + 3 j3 2 ( 1 - p ) t cPc + 3p(\-p)\ P 2 c + {\-pf P l (6) 

Our first result shows that it is not always possible to identify j3 from the coder's 
ratings. 

Proposition 1 Let (p , y, p) be a coder model and assume that there is cq € C such that 
y(k) = cq for all k 6 Njv- Then for every /3' < j3 + (1 — fS)p Co there is a coder model 
Q3', y,p') such that 

P5 cm + (1 - P)p c = p'S cm + (1 - J3'y e (7) 

/or all c eC, k e F%. 

Proof Given j8' < P + (1 — P)p co , we only have to show the existence of a vector 
p' e [0, l] m with Lec/'c = 1 such that Q holds. 

Assume first that j8' = 1. Then 1 = j8' < j8 + (1 - P)p CQ < 1 so 

(l-j8)(p co -l) = (8) 

Hence either /3 = 1 or /? Co = 1 . In the first case the statement is trivially satisfied and 
in the second case we may set p' c — p c for all c g C and obtain either 1 (if c = co) or 
on both sides of (0, proving the statement in this case. 
Now assume J3' < 1. Then 

pi<p+(i-p) Pco =p + (i-p)(i-Y,Pc)<i-(i-p)p c 

c^c 

for all c € C\ {co}. Hence (1 - P)p c < 1 - j8', so defining 

, 1-0 

we obtain € [0, 1], for c 7^ co. Also define 

, _ p-p' + (l-p)p c 

PCQ- 1 — JS' 
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Then p' CQ > Oby the condition on J3' and j8 -J3' + (l ~P)p c < ]8 -J3' + (1 - J3) = 1 -j8' 
shows pj. < 1. Finally, 

L - Pc + L = j— or + y~b ,Pc ° + y~B'^ ~ Pc °> = 1 

ceC c^cq h h V 

completes the proof. □ 
Note that in Proposition [T] we may always choose j8' = 0, so the rating cannot be 
distinguished from a completely random one, but of course at the cost of a distribution 
p' possibly far from uniform. In the case j3 = 1, j3' = the distribution p' is atomic at 
co, which somewhat challenges the intuition of "random agreement". 

On the other hand, unless we know that #y(Nw) > 1, we are actually unable to 
determine the reliability parameter j3 . 

Proposition 2 Let (/3,y,p) be a coder model and assume that x c < 1 for all c G C. 
Then the following holds 

(i) ife2.c — e\ CQ for some cq then either T CQ = or = 0. 
(H) &2,c — e i c for all c € C if and only if /3 = 
(Hi) ife 2 .c ^ el co for some c then e 2 ,c > «i )C0 and 



p=\ : z j \ (9) 



and p CQ is given by 





-e 2 


1 e 2 , co - 






- ?J 


e hc - 





Pco = [_p " (10) 



Proof From (O and © for any c EC 

e 2 u = p\ + 2)3(1 - j8)t cPc + (1 - j3) 2 p? 



so 

e 2 , c - e 2 hc = li 2 t c (1-t c ) (11) 

Since by assumption T £ < 1 equation (fTTT i shows part [(I)] And since T 6 = 1 there 
is < T eo < 1 for some co G C proving part |(ii)| Solving (TTTb for j3 and (O for p CQ 
completes the proof. □ 
One application of this result is, that one may determine the range of the distribution 
p c from the results of a pre-study with a carefully chosen set of items with known "true" 
categories which meet the assumption < T c < 1 for all c € C. Once we know the range, 
i.e. min ce c(Pc) and max c ec(Pc) of the a priori distribution and can reasonably assume 
that it does not change for an arbitrary set of items, we can estimate the reliability 
parameter for arbitrary distribution T of true categories using the following 
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Proposition 3 Let (j3 , y, p) be a coder model, assume that tzq < p c < it\ < 1 /or c 6 C 

and define e 2 := £ c6 c e 2,c- T/ien the following estimate holds 



max{Q 1 e 2 -%i) < ^ < /e 2 - TTo 



V 1 — TTi V 1 — ^0 

Proof From (0 we see that 

e 2 = £ (j8 2 t c + 2J3(1-j8)t cPc + (1-J3) 2 ^) (12) 

cGC 

= P 2 + 2/3(1 - J8) £ T £ . Pc . + (1 -/3) 2 £ p 2 (13) 

eeC cGC 

Now, since < T c we may estimate 7ToT c < /? C T C < KiT c and 7lb/? e < p 2 . < K\p c and 
since £ fGC T c = 1 = Y.cec Pc mus obtain 

no = £ MoT c < £ T c /j c < 7Ti , and (14) 

cec ceC 

>ib (15) 

CGC 

Thus we may <? 2 estimate from below 

e 2 > /3 2 + 2/3(l - j3)7r + (l-j3) 2 ?r = (1 - no)? 2 + JTo (16) 

(which implies <? 2 — 7To > 0) and similarly from above (replacing % by 7Ii). Since by 
assumption 1 — 7Ti > and 1 — 7Zq > we obtain the desired estimates. □ 

Observe, that the preceding proposition does not assume anything about T c , it even 
holds if T CQ = 1 for some cq. 

If p c is the uniform distribution we may put Tto = %\ = -j- in the preceding result and 

obtain the equality /3 2 = ei p which is the 5- value of Bennett, Alpert and Goldstein 

The 5-value has been criticized by Scott |[T4ll that it could be increased by adding 
spurious categories which would never or hardly ever be used. But, as Scott also notes, 
such a modification would contradict the assumption of uniform distribution for p c , 
hence by such a modification j3 can no longer be determined by the 5-value formula. 
We may however use the preceding proposition to obtain estimates for j3 : if one adds 
a category cq that a coder wouldn't use, the a priori probability p CQ is and so is the 
minimum, hence we would obtain the inequality 



- — — < ft < y/ei. 

1 — K\ 

Since J e \Z^ < V^2' adding such a spurious category results in a larger possible in- 
terval for j3, i.e. a worse estimate. 

If we even know the distribution p we are able to compute j3 exactly. The same is 
true if the a priori probability p c matches the item category distribution z c . This is the 
content of the following 
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Proposition 4 Let (j3 , y, p) be a coder model and assume that T c < 1 for all c 6 C. 
(i) Ifp is known, j3 can be computed as 

0, if p c = 1 and e\ tC = 1 
£j _ J — i_ e , c ~ ■ WPc = 1 ande XiC < 1 



2 V 1-Pc + Pc J + y 4 ^ l-p c 



i \ i 1 '- I-' 1 ' 1 ; / 1 ' !.: 1 e i,c\ e \.C e 2,C ■ rr\ , , 1 



Pc / Pc(l-Pc)' 

(17) 



( z'i) IfO < T c = p c < 1 for some c € C. Then 



P = \—7i ^ (18) 

V Cl,c(l-«l,cJ 

Proof From (O and d3) we obtain 

e2 ,, = (J3+2(1 -J3)p c )( ei>c - (1 -J3)p c ) + (1 -j3) 2 p 2 

= (J3- 1)V(1 ~Pc) + (J3- l)(ei, c + Pc-2/J c ei, c ) + e 1 , c 

hence 

/(j3 ) : = 08 - 1 f Pc ( 1 - p e ) + 08 - 1 ) (p c ( 1 - 1 , c ) + 1 >c ( 1 - p e ) ) + (e , >c - e 2 , c ) = 

(19) 

First observe, that ej x . — e2 >c = Y,c'ec e 2,c,c' > e 2,c,c — e 2,c = Since £ cGC p c = 1 there 
is c G C with p c > 0. 

If Pc = 1 then / is linear. The linear term also vanishes, if in addition e\ c = 1. In 
this case = e \ c ■— 1 = /5(t c . — 1), so j3 = 0. On the other hand, if ej x ^ 1 we can solve 
(fT~9b for /3 and obtain the second case of (TFTI i. 

Now assume < p c < 1 then e^ x ~ e\ c > by Proposition| 3tiii)| and 

/(0) = -(p c - ei)C ) 2 -( e2)C -4 c )<0 
/(l) = ei,c-e2, c >0 

so there is one zero of / in the interval [0, 1] and one in ] — °°,0]. Solving ( fT9l ) for j3 
and discarding the lower solution yields ( TFTI ). 

Finally, if T c = p c we have e\ c = p c G ]0, 1 [, hence ci, c (l — ei,c) 7^ and 

e 2 , c ~ e\ c = p 2 P c(l - Pc) = P 2 ei, c (l - e x>c ) 

immediately shows ( fT8l . □ 
Assume that the population of items is a representative sample from the universe 
of items and that the coders know about the distribution of categories ("base rate") in 
the universe (such a situation seems not uncommon in medical or psychological diag- 
nostics) then part [(n)] provides a simple method to compute reliability. If the coder's 
assumption on the base rate differs from the "true" category distribution j8 can be com- 
puted from part[(i)| 
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As was announced in the introduction our model does not share the "annotator 
bias" property, which is obvious from the definition of the model. It is also known that 
K may be increased or decreased by combining categories (see |16|). Therefore it is 
worth recording the following proposition which shows that j3 does not change when 
combining categories or adding spurious ones. 

Proposition 5 Let (j3 , 7, p) be a coder model with coder process X^. Let C' be a finite 
set and letQ.C^C 1 be some map. LetX' k : = (poX^, 7' = <t>o 7 and for every c' G C 
let p' c , — Y,ce$>- l (c')Pc (with the understanding that p' cl — whenever <I> _1 (c') = 0) . 
Then X' k is a coder process for (j3,7 ,p ), i.e. 

P(4 = c ') = j85 c , yw + (l-j8)^ (20) 

for all k G N N , c' G C. 

Note that the definition of p' c , in the proposition just defines the distribution of 
<f> o Yk on C with Fj from ([T). Hence the proof is immediate from (0. 

Now recall the discussion at the end of Section [2] and assume for a moment that /3 
would depend on y(k), so the original model would have the distribution 

F(X k = c)= Py(k)8c,y(k) + (1 - Py{k))pc 

i.e. the mixture coefficient P^oa depends upon the support of the atomic measure. 
Transforming the classes as in the preceding proposition, instead of (1201 we would 
arrive at the equation 

F ( x 'k = c ') = Py(k)8c>,Y(k) + (1 - Py(k))p'c> 

so j8 no longer depends upon the supporting element Y(k) of the atomic measure alone. 

Proposition [5]provides a necessary condition for the validity of the model: if one 
observes in an experiment that j3 significantly changes when recomputed after com- 
bining categories, the assumptions of the coder model are not met. The numerical 
simulations in the following section may give some indication which level of j3-change 
could be considered as significant. 

Now we state and prove the main result on the identification of /3 in the general 
case. 

Theorem 1 Let (j3 , 7, p) be a coder model with coder processes Xnfor i G Nr, k G Nat. 
Moreover let C* — {c G C : 7^ e \ c }- If^c < I for all c^C, then the following holds: 

(i) C* = if and only iffi=0. 

(ii) IfC* ^ then #C* > 2. 

(Hi) If#C* = 2 then j3 = V^a + b 2 , where 

a := <?2 c — e i e fln " " := 5 3ei c 

e 2,c-e\,c 

for some c G C* . 
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(iv) If#C* > 3 then 



1 ( ^ e 3lC -el c 



ffL 1 \cec* e 2,c~e l c ceC ^ ct J 

(v) Let C* = {c\ , . . . , c m * } ant/ assume m* > 3. For /, G N m * /e/ 

«2.c/,Ci-«l,c/«l.c ( - 

Pi,; : " 2-^- 

Then X G W is a solution of 

= Xipij — XtPk.jfor all i,j,k G N m * with i^= j ^ k (22) 

m*-l = £A; (23) 
if and only if ' Xj = 1 — T,-. Moreover, for the solution A,- the following holds 

P = 



Ic 6 c(*V-4c) ^ 



^ l-i^a-A,-) 2 

Proof Part [(I)] is just a restatement of Proposition| 31n)| 

By|(I)]C* 7^ implies j3 7^ and by (□} t Cq > for some c G C*. Since T Co < 1 
and £ e6 c T c = 1 there is ci € C, ci 7^ co with T Cl > and again by ( fTTT i e2.r, — <? 2 > 0, 
so C] G C* proving |(ii)| 

To prove part |(iii) write C* = {to, Ti}. From (|6]l we see that for every c EC 

e Xc - e\ c = p 2 T C ( 1 - T c ) Q3 ( 1 + Xc) + 3 ( 1 - P )p e ) 

= (e 2 . c - e\ c ) (1 + xc) + 3(1 -P)p e ) (25) 

using ( fTTT i above in the last step. From ( fTTT ) and ( |25] | we obtain a = j3 2 T f (1 — T e ) > 
and b = j3(l — 2t c ), where T e = To or T c = T\. Since To + Tj = 1 we see that a is 
independent of c that b is uniquely defined up to its sign. So V4a + b 2 is well defined 
and independent of the choice of c in the definition of a and b. Now 

(1 - 2T C ) 2 A = (1 -2T t ) 2 /3 2 T c (l - T,) = b 2 T c (l - Te) 

and thus 

{Aa + b 2 ){x c - 1 -) 2 ^. (26) 

Hence 4a + b 2 = implies b 2 = and so a = 0. Now & = if and only if p = or 
T e = 5 and a = if and only if j3 = or T e G {0, 1}, which shows that 4a + b 2 = if 
and only if /3 = 0. 
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On the other hand, if 4a + b ^ we may solve d26l > for T c and obtain 

1 / b 

T c = - 1 ± ■ 



2 V V4a + b 2 

and using the definition of b (and that j3 > 0) |(iii)| is proved. 
Now we prove [(iv)| From d25l l we obtain for each c £C* 

e3 ' C ~ e l c =P(l+T c )+3(l-P)p c , 

Now, since C* ^ by part j3 ± 0, Hence for all c € C \ C* we get T c = by 
Proposition^!!)! Thus 

£ T c = £ T c = 1 

cec* cec 

and that e ljC = (1 - /J )/? c for c £ C \ C* . This shows 

I £2fZ ^f = P L (1 + + 3(1 - /3) E Pc 

CGC* e 2,t «1 )C eGC* ceC* 

= /3(#C* + l) + 3(l-/3)(l- £ Pc ) 

V CGC\C* / 

= j3(#C*-2)+3-3 £ e liC 

fGC\C* 



Since #C* ^ 2 we may solve for j3 which finishes the proof of |(iv)| 

Next we prove |(v)| As in the proof of part |(iv)| j3 ^ and T c — if and only if 

ceC\C*. 

Now combine (fTTT i, (0, and (HI to see that 

SjjXi-XjXj_ 

P,J ~ t,(1-t ; ) (2?) 

so the proof of the "if'-part is obvious. 

Now assume that some A G R m solves d22"l i and that Y4L1 A; = m* — 1 . Since T< < 1 
for all z G N m and *i = T; = 1 there are i, k £ N m » , z ^ k such that T,- ^ ^ T*. 
So for all 7 eN m *\ {i^j- 



and 



A; 


_ A;P;,i 


_ Ajfep^i _ A* 




Ti 


T; 1 - X k 


h 


_ A;P;,i 


_ hpi,k _ A; 


1-^7 




X k 1 - Tj 



Since m* > 3 the set N m * \ {/,&} is not empty and thus 6 := = -j-^- holds for all 
7 e N m *. This shows that Xj = (1 - T/)fl for all y G N m *. Now 

l = £Ai=0(m*-l) 



i=l 
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implies 9 = 1, so A, = 1 — T,. Finally, since 

£ ( e2 , c - e y = p 2 £ Tc (i - t,) = ^ 2 (i - £ r t 2 ) 

cec* ceC* cec* 

m* 

= /3 2 (l-£(l-A ; ) 2 ) 
i=i 

and using that since m* > the left hand side is positive so we may solve for j3 and 
obtain d24l) , □ 

Using that T; ^ for j € N m * we see from d27b that p,-.y 7^ for z',j € N m », so 
d22l can easily be solved by forward substitution. Experience shows, however, that 
computing /3 according to part |(v)| is numerical unstable. Its virtue lies in the fact that 
it shows that j3 is uniquely determined by double coincidence expectations e 2 i /' wr, i cn 
could be estimated from the ratings of two coders, but only if m > m* > 3, i.e. if there 
are at least three categories. 

Parts |(iv)| and |(iii)| use the triple coincidence expectations, which require the ratings 
of at least three raters but are applicable for all m > 2. 

This raises the question of whether /3 is uniquely determined given double coinci- 
dence expectation values even in the case m = 2. The next proposition shows that this 
is not the case. 

Proposition 6 Let m — 2 and assume that x c < 1 for c £ C. Let (/3 , 7, p) be a coder 
model with expectation values e\ Xv e2.c 1 ,c 2 > c i i c 2 G C, e\ := max(ei iCl ,e\ tC2 ) and 

( [0,1], if ei = i 

!■={ 8^ (i-z ) (28) 

\ [0,2(1 - gl )] U [1 - e x + P T f_ ei %Ci) , 1], ife x < 1 

Then if 

f}> e [2/3^,(1-^,)^ + ^ 2 ^ l(1 ' Tci) ] m (29) 

v e\ 

and j3' 2 = 4j3 2 ^lil_^il f or some n S Nat such that n+N € 2N then there is a coder 

model (j3', /',//), where y' : — > C which yields the same expectation values Ci, Cl , 
«2,ci,q» Ci,C 2 e c. 

Proo/ Write C = {ci,c 2 }. First observe that e\ Cl — 1 — ci, ci . e 2. ci ,c 2 — e l,ci — 
e %c\c\ ar, d thus 

^2,C2,C2 = ^2,C2,ci = 1 2^i )C j + ^2,ci,ci* 

So we only need to show <?j t[ = <?j and e2 jCljCl = e 2ci e , f° r tne corresponding ex- 
pectations <?2 C] C] , c'i Cl of the model (j8', Y,p'.)- 

Let co G C be such that e CQ = e\. Since ei )C] +ei, = 1 we have ei = e CQ > j . We 
also abbreviate e 2 := e2,c ,co> e 'i : = e' , ande' 2 := e 2 ,c ,c > T = Tc , P = />c , 

Observe also that z c < 1 for all c 6 C implies < T c < 1 for all c € C and that all 
conditions in the statement of the proposition are invariant if t Ci is exchanged for T. 
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Thus if j3' = also j3 = and the statement of the proposition is trivially satisfied in this 
case. So for the following we may assume that J3' >O.By£Be 2 -ei = J3 2 t(1 - t) 
and by assumption j3' > 2/3 \/x(l — x), so 



is well defined and satisfies 

j3' 2 T'(l-T')=i8 2 T(l-T) 

Hence 

4 = f5' 2 x'(l - X 1 ) +e[ 2 = j3 2 T(l - x) + e\ ~ =e 2 + (e'\ - e\), 
so we only need to prove 

ei =e[=P'x' + (l-p')p' (31) 
Case j3' = 1: In this case from the assumption we see that 

1 = p <cH : =e\-\ = — < 1 

el e\ e\ 

(using <?2 < LcGC e 2,c .c = e i)' so e 2 = e i- Now from (0 and © 

= e 2 -e l = -(l-P)(Px(l-p) + Pp(l-x) + (l-P)p(l-p)) (32) 

Since every summand in the second factor of d32b is non-negative and < x < 1 this 
implies that either /3 = and /? e {0, 1} or j3 = 1. 

First assume /3 = 0. Since by assumption 0< I < ei = j3r+(l— j3)/? only p = 1 
is possible. Now from (f30b we obtain t' = 1 and from p = 1 

ei = /3t+ (i - p> = = i = /3 v = e ; 

On the other hand, if j3 = 1 we get x = e\ > \ and so 

e\ = X 1 = ^+ l -^J I -Ax{\ - x) = X = e x . 

This concludes the case j8' = 1. 
If /3' < 1 we may solve 

ei =/3V + (l-p>' 

for // and it remains to show that < p' < 1 . 
From the assumption 

ei/3'< e 2 + j3 2 T(l-T) 
and after reordering and completing the square we find that 



j3 /2 -4j8T(l-T)<|2e 1 -j8'|=2e 1 -j3' 
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where the last equality follows from j8' < 1 < 2e\. This shows that 



and thus p' > 0. 

To prove p' < 1 first assume e\ = \. Then 

l- ei=J 8(l-T Co ) + (l-j3)(l- Pco ) 

and since z CQ < 1 we conclude that /3 = and p CQ = 1. This implies t' = 1 and so 

,_ ei-J3Y _ l-fl' 
P 1-/3' 1-/3' 

If ei < 1, by assumption, /3' < 2(1 — e\) or - + \ — e\ < /3' so again by 

reordering and completion of the square one sees that 



j8'-2(l- ei )<0or|j8'-2(l-ei)| < J j5' 2 -4/3t(1 - z 



i.e./3'-2(l-ei) < ^J^' 2 - 4/3 t(1 - t) and thus 



«i - /3V = ei - -p' - - \//3' 2 - 4J3<1 - t) < 1 - p 



proving p' < 1 . 



Finally, let /3' 2 = 4/3 2 2^ for some n e N/v such that n + N G 2N. Then by ((30j 

1— T7T 



T' = i + ^/l-l 



2 2V A? 2 2JV 

So t'JV is a natural number and e.g. defining y'{k) = c\ for k < and y'(k) = cj_ 
otherwise, completes the proof. □ 
If in the preceding proposition N is large enough several points of the set 

| 4j3 2^(l_j) . „ eNj)A , 5 „ + jve2N} 
(where < r\ < 1) fall into the set 

[2j5^ir^T), ei + ^ >-}m 

e\ 

(if it has inner points) so the reliability parameter can not be determined uniquely. 
FigureQ]shows the /3'-range given by d29l i for some random example. 

So in the two-category case we need an estimate of e3 jC in order to apply part 
of Theorem[T] i.e. we need at least three coders to determine /3 in this case. 

As a consequence for the popular two-coder/two-category examples /3 (more pre- 
cisely the triple Q3,T,p)) is not uniquely determined. In order to determine /3 in such 
a situation we thus either need to know p and use Proposition|4]or z and apply Propo- 
sition |2] 
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0.2 0.4 0.6 0.8 1 
beta 

Figure 1: Example of the j3 '-region (shaded) according to (|29t of Proposition [6] The 
straight line inside the region indicates the value of /3, i.e. the diagonal. Parameters are 
T = (0.7,0.3),/?= (0.6,0.4) 

4 Numerical Simulations 

The formulae provided by Theorem [T] involve expectation values of coder agreement 
frequencies. In experiments we typically do not know expectation values but rather 
observe relative frequencies. Hence we will not obtain the correct values for j3 using 
the formulae in parts |(iii)| and |(iv)| of Theorem Q] Actually, these formulae involve 
differences of expectation values which are close to for small values of j3, so small 
statistical fluctuations might lead to large deviations in /3 . Thus in order to improve the 
accuracy we reformulate the problem as a least square optimization problem for the 
expectation values e\ c ,e% c ^ ci and (if #C > 2) e$ c , using the formulae for /3 to obtain 
a start value (augmented by approximations for T and p according to ( fTTT i and (??) 
respectively). So find /3, T, p such that 

£(/3T c + (l-j3) Pc - ei , c .) 2 

cec 

+ £ (P\ + 2fl(l-l3)p c T c + (l-l3) 2 p 2 c -e 2 , ChC2 ) 2 

c\ X2&C 

+ £ (/3 3 t, + 3/3 2 (l -P) Pc z c + 3J8(1 - P) 2 p 2 c t c + (1 - P) 3 p 3 c - e 3 . c ) 2 

C£C 

is minimized, subject to the natural constraints. 

As we already noted in the introduction the model based approach chosen here 
allows for simulation runs to investigate the accuracy of this algorithm. The remainder 
of this section is devoted to such numerical experiments which show the accuracy with 
varying model parameters. We display the results as inverse empirical distribution 
functions for a sample of 1000 randomly chosen realizations of the coder model, so 
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Figure 2: Estimation errors as a function of the true /3 value. Fixed parameters: N = 
1 00, m = 3, R = 5, t = (0.3, 0.6,0.1),/? =(0.33, 0.33, 0.34) 



the abscissae contain the quantiles and the ordinates the absolute errors (observe the 
ranges). In the plots the values for the 50 %, 80 %, 90 %, 95 %, 98 %, and 100 % 
quantiles are highlighted. For every plot we also indicate the other parameters in the 
caption. The meaning of the parameters is that of the coder model defined in Section[2] 

The accuracy in j3 estimation depends on the actual value of j3. As Figure|2]shows, 
the error decreases with increasing true value of j3. The 98% quantile accuracy ranges 
from 0.032 at %- ue = 0.95 to 0. 105 at % me =0.5. 

According to the coder model a value of j3 = 0.5 means that only for half of the 
items the raters could determine the categories with certainty. Note also that if the 
assumptions of Proposition [3] are satisfied the 5-value would be as low as 0.25 in this 
case. Hence the really interesting range for j3 is above 0.5 where the accuracy is higher. 

As has been noted before, the definition of j3 does not exhibit the "problem of 
prevalence", i.e. its value does not decrease when max(r) approaches 1. Though the 
value of j3 is not affected it does affect the accuracy as Figure [3] shows. The 98% 
quantile accuracy ranges from 0.032 for max(r) = A (the least value of max(T) in 
this setting) to 0.077 for max(r) = 0.90 and 0.22 for max(r) = 0.95. In this latter 
case there are only five of the items not belonging to the prevalent category. Hence 
statistical fluctuations may blur the distinction of this case from the case max(r) = 1 
where /3 can no longer be determined according to Proposition Q] So this decrease in 
accuracy is expected. 

Contrary to the rather strong impact of T on the accuracy, the a priori distribution 
p does no significantly influence the accuracy as the following Figured] shows. Here 
the 98% quantile errors range from 0.049 to 0.058 which may be fully attributed to 
statistical fluctuations. 

Finally, since the errors originate from deviations of relative frequencies from the 
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quantile 



Figure 3: Error of j3 estimate for different true class frequencies. Fixed parameters: 
N = 100, m = 3, R = 5, /3 = 0.85, p = (0.33,0.33,0.34) 




Figure 4: Error of j3 estimate for different a priori distributions. Fixed parameters: 
AT=100,m = 3,/? = 5,j3 =0.85,T = (0.3,0.6,0.1) 
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Figure 5: Error of j3 estimate for different number of coders. Fixed parameters: N = 
100,m = 3, j3 =0.85, T= (0.3, 0.6,0.1), />= (0.33,0.33,0.34) 
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Figure 6: Error of j3 estimate for different number of items. Fixed parameters: m — 
3, R = 5, J3 =0.85, T= (0.3, 0.6,0.1), p= (0.33, 0.33, 0.34) 
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expectation values, the accuracy depends of course moderately on both the number of 
coders and the number of items. Figure [5] shows the influence of the number of coders 
on the accuracy. At the 98% quantile level the errors range from 0.03 (15 coders) to 
0.07 (3 coders). Actually as few as five coders suffice to obtain a reasonably low error 
of 0.053. 

The impact of the number of items can be seen from Figure [6] With as few as 
20 items one cannot expect more than a rough estimate of beta (error 0.115 at 98% 
quantile) with a reasonably low error of 0.054 when coding 100 items. 
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